实体标注、序列标注工具

2024-07-07 18:36| 来源: 网络整理| 查看: 265

笔者研究方向为NLP知识抽取，做实体抽取实验过程中需要对训练数据进行标注。 github完整数据及代码

我先使用jieba分词对原文本进行分词和pos词性标注，然后基于pos词性标注抽取出文本中的公司名、证券、基金名称（这部分也可以使用正则方法）等，保存到word_dict.txt中作为词典，然后基于该词典对原文本中进行的数据进行标注。 word_dict.txt如下： INT与BON文本对应的标签。占位词 NONE，这一行必须要有，作为词典的停止关键词

启迪设计集团股份有限公司 INT 北京光环新网科技股份有限公司 INT 周口市综合投资有限公司 INT 上海汉得信息技术股份有限公司 INT 湖南湘江新区投资集团有限公司 INT 融信福建投资集团有限公司 INT 湖南尔康制药股份有限公司 INT 厦门灿坤实业股份有限公司 INT 中融国证钢铁行业指数分级证券投资基金 BON 华中证空天一体军工指数证券投资基金 BON 富国新兴成长量化精选混合型证券投资基金 BON 江西省政府一般债券 BON 占位词 NONE

标注源码：

# -*- coding: utf-8 -*- ''' 基于外部词典对数据进行标注 BIO方式 Author:西兰 Date：2019-8-26 ''' features_list = [] with open('./data/word_dict.txt','r',encoding='utf-8') as f: for line in f.readlines(): features_list.append(line.strip().split(' ')[0]) #print(features_list[0]) ''' 创建特征词列表、特征词+tag字典（特征词作为key，tag作为value） ''' #将features_dict中的特征词和tag存入字典特征词为key，tag为value dict={} with open('./data/word_dict.txt','r',encoding='utf-8') as f: for line in f.readlines(): item = line.split(' ') #print(item) if len(item) >1: dict[item[0]]=item[1] else : with open('./data/error.txt','a',encoding='utf-8') as f: f.write(line+"\n") ''' 根据字典中的word和tag进行自动标注，用字典中的key作为关键词去未标注的文本中匹配，匹配到之后即标注上value中的tag ''' file_input = './data/dev_unlabel.txt' file_output = './cut_data/dev_labeled.txt' index_log = 0 with open(file_input,'r',encoding='utf-8') as f: for line in f.readlines(): print(line) word_list = list(line.strip()) tag_list = ["O" for i in range(len(word_list))] for keyword in features_list: print(keyword) while 1: index_start_tag = line.find(keyword,index_log) #当前关键词查找不到就将index_log=0,跳出循环进入下一个关键词 if index_start_tag == -1: index_log = 0 break index_log = index_start_tag+1 print(keyword,":",index_start_tag) #只对未标注过的数据进行标注，防止出现嵌套标注 for i in range(index_start_tag, index_start_tag + len(keyword)): if index_start_tag == i: if tag_list[i] == 'O': tag_list[i] = "B-"+dict[keyword].replace("\n",'') # 首字 else: if tag_list[i] == 'O': tag_list[i] = "I-"+dict[keyword].replace("\n",'') # 非首字 with open(file_output,'a',encoding='utf-8') as output_f: for w,t in zip(word_list,tag_list): print(w+" "+t) if w != ' ' and w != ' ': output_f.write(w+" "+t+'\n') #output_f.write(w + " "+t) output_f.write('\n')

通过以上代码标注之后的数据：

鹏 B-INT 华 I-INT 基 I-INT 金 I-INT 管 I-INT 理 I-INT 有 I-INT 限 I-INT 公 I-INT 司 I-INT 申 O 请 O ， O 本 B-INT 所 I-INT

喜欢编程的同学可以关注我的公众号：编程ABC，欢迎投稿哦~ 在这里插入图片描述

【本文地址】

公司简介

联系我们